# Initialize Otter
import otter
grader = otter.Notebook("hw2-seda.ipynb")
import numpy as np
import pandas as pd
import altair as alt
# disable row limit for plotting
alt.data_transformers.disable_max_rows()
# uncomment to ensure graphics display with pdf export
# alt.renderers.enable('mimetype')
DataTransformerRegistry.enable('default')

Background

Gender achievement gaps in education have been well-documented over the years – studies consistently find boys outperforming girls on math tests and girls outperforming boys on reading and language tests. A particularly controversial article was published in Science in 1980 arguing that this pattern was due to an ‘innate’ difference in ability (focusing on mathematics rather than on reading and language). Such views persisted in part because studying systematic patterns in achievement nationwide was a challenge due to differential testing standards across school districts and the general lack of availability of large-scale data.

It is only recently that data-driven research has begun to reveal socioeconomic drivers of achievement gaps. The Standford Educational Data Archive (SEDA), a publicly available database on academic achievement and educational opportunity in U.S. schools, has supported this effort. The database is part of a broader initiave aiming to improve educational opportunity by enabling researchers and policymakers to identify systemic drivers of disparity.

SEDA includes a range of detailed data on educational conditions, contexts, and outcomes in school districts and counties across the United States. It includes measures of academic achievement and achievement gaps for school districts and counties, as well as district-level measures of racial and socioeconomic composition, racial and socioeconomic segregation patterns, and other features of the schooling system.

The database standardizes average test scores for schools 10,000 U.S. school districts relative to national standards to allow comparability between school districts and across grade levels and years. The test score data come from the U.S. Department of Education. In addition, multiple data sources (American Community Survey and Common Core of Data) are integrated to provide district-level socioeconomic and demographic information.

A study of the SEDA data published in 2018 identified the following persistent patterns across grade levels 3 - 8 and school ears from 2008 through 2015: * a consistent reading and language achievement gap favoring girls; * no national math achievement gap on average; and * local math achievement gaps that depend on the socioeconomic conditions of school districts. You can read about the main findings of the study in this brief NY Times article.

Below, we’ll work with selected portions of the database. The full datasets can be downloaded here.

Assignment objectives

In this assignment, you’ll explore achievement gaps in California school districts in 2018, reproducing the findings described in the article above on a more local scale and with the most recent SEDA data. You’ll practice the following:

  • review of data documentation
  • assessment of sampling design and scope of inference
  • data tidying operations
    • slicing and filtering
    • merging multiple data frames
    • pivoting tables
    • renaming and reordering variables
  • constructing exploratory graphics and visualizing trends
  • data aggregations
  • narrative summary of exploratory analysis

Import and assessment of datasets

You’ll work with test data and socioeconomic covariates aggregated to the school district level. These data are stored in two separate tables. Here you’ll examine them and review data documentation.

Test score data

The first few rows of the test data are shown below. The columns are:

Column name Meaning
sedalea District ID
grade Grade level
stateabb State abbreviation
sedaleaname District name
subject Test subject
cs_mn_... Estimated mean test score
cs_mnse_... Standard error for estimated mean test score
totgyb_... Number of individual tests used to estimate the mean score
# import seda data
ca_main = pd.read_csv('data/ca-main.csv')
ca_cov = pd.read_csv('data/ca-cov.csv')

# preview test score data
ca_main.head(3)
sedalea grade stateabb sedaleaname subject cs_mn_all cs_mnse_all totgyb_all cs_mn_asn cs_mnse_asn ... totgyb_whg cs_mn_wht cs_mnse_wht totgyb_wht cs_mn_wmg cs_mnse_wmg totgyb_wmg cs_mn_wng cs_mnse_wng totgyb_wng
0 600001 4 CA ACTON-AGUA DULCE UNIFIED ... mth -0.367007 0.108543 86.0 NaN NaN ... 79.0 -0.208654 0.165783 35.0 -0.089003 0.518066 38.0 NaN NaN NaN
1 600001 4 CA ACTON-AGUA DULCE UNIFIED ... rla 0.005685 0.117471 85.0 NaN NaN ... 78.0 0.259587 0.189614 35.0 0.526942 0.602989 38.0 NaN NaN NaN
2 600001 6 CA ACTON-AGUA DULCE UNIFIED ... rla -0.000040 0.092172 114.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 59 columns

The test score means for each district are named cs_mn_... with an abbreviation indicating subgroup (such as mean score for all cs_mean_all, for boys cs_mean_mal, for white students cs_mn_wht, and so on). Notice that these are generally small-ish: decimal numbers between -0.5 and 0.5.

These means are estimated from a number of individual student tests and standardized relative to national averages. They represent the number of standard deviations by which a district mean differs from the national average. So, for instance, the value cs_mn_all = 0.1 indicates that the district average is estimated to be 0.1 standard deviations greater than the national average on the corresponding test and at the corresponding grade level.

Question 1: Interpreting test score values

Interpret the average math test score for all 4th grade students in Acton-Agua Dulce Unified School District (the first row of the dataset shown above).

Type your answer here, replacing this text.

SOLUTION In Acton-Agua Dulce Unified School District, the mean math test score for fourth graders in 2018 is estimated to be 0.37 standard deviations below the national average.

Covariate data

The first few rows of the covariate data are shown below. The column information is as follows:

Column name Meaning
sedalea District ID
grade Grade level
sedaleanm District name
urban Indicator: is the district in an urban locale?
suburb Indicator: is the district in a suburban locale?
town Indicator: is the district in a town locale?
rural Indicator: is the district in a rural locale?
locale Description of district locale
Remaining variables Demographic and socioeconomic measures
ca_cov.head(3)
sedalea grade sedaleanm urban suburb town rural locale perind perasn ... snapall snapblk snaphsp snapwht single_momall single_momblk single_momhsp single_momwht seswhtblk seswhthsp
0 600001 4.0 ACTON-AGUA DULCE UNIFIED ... 0.0 0.0 0.0 1.0 Rural, Distant 0.003893 0.045901 ... 0.035165 0.20293 0.0819 0.032362 0.084385 0.349636 0.198482 0.061653 1.839339 0.692566
1 600001 5.0 ACTON-AGUA DULCE UNIFIED ... 0.0 0.0 0.0 1.0 Rural, Distant 0.003788 0.046652 ... 0.035165 0.20293 0.0819 0.032362 0.084385 0.349636 0.198482 0.061653 1.839339 0.692566
2 600001 6.0 ACTON-AGUA DULCE UNIFIED ... 0.0 0.0 0.0 1.0 Rural, Distant 0.003218 0.043657 ... 0.035165 0.20293 0.0819 0.032362 0.084385 0.349636 0.198482 0.061653 1.839339 0.692566

3 rows × 60 columns

You will only be working with a handful of the demographic and socioeconomic measures, so you can put off getting acquainted with those until selecting a subset of variables.

Question 2: Data semantics

In the non-public data, observational units are students – test scores are measured for each student. However, in the SEDA data you’ve imported, scores are aggregated to the district level by grade. Let’s regard estimated test score means for each grade as distinct variables, so that an observation consists in a set of estimated means for different grade levels and groups. In this view, what are the observational units in the test score dataset? Are they the same or different for the covariate dataset?

Type your answer here, replacing this text.

SOLUTION The observational units are school districts. They are the same for each dataset.

Question 3: Sample sizes

How many observational units are in each dataset? Count the number of units in the test dataset and the number of units in the covariate dataset separately. Store the values as ca_cov_units and ca_main_units, respectively.

(Hint: use .nunique().)

ca_cov_units = ca_cov.sedalea.nunique() # SOLUTION
ca_main_units =ca_main.sedalea.nunique() # SOLUTION

print('units in covariate data: ', ca_cov_units)
print('units in test score data: ', ca_main_units)
units in covariate data:  913
units in test score data:  872
grader.check("q3")

Question 4: Sample characteristics and scope of inference

Answer the questions below about the sampling design in a short paragraph. You do not need to dig through any data documentation in order to resolve these questions.

    1. What is the relevant population for the datasets you’ve imported?
    1. About what proportion (to within 0.1) of the population is captured in the sample? (Hint: have a look at this website.)
    1. Considering that the sampling frame is not identified clearly, what kind of dataset do you suspect this is (e.g., administrative, data from a ‘typical sample’, census, etc.)?
    1. In light of your description of the sample characteristics, what is the scope of inference for this dataset?

Type your answer here, replacing this text.

SOLUTION

The population of interest is all school districts in California in 2018. There are ~1000 school districts in total in 2022-2023, and although districts change a little from year to year, the total number doesn’t change too much in adjacent years. So this sample from 2018 likely covers about 90% of the population. The dataset is best described as administrative data: it is not a census, because the entire population is not included; and it is not a ‘typical’ sample or a random sample because there is no random selection mechanism from a well-defined frame. This data does not support inference and should be used for summary purposes only – conclusions should not be generalized beyond the sample. However, this actually isn’t much of a limitation here because the sample covers so much of the population.

Data tidying

Since you’ve already had some guided practice doing this in previous assignments, you’ll be left to fill in a little bit more of the details on your own in this assignment. You’ll work with the following variables from each dataset:

  • Test score data
    • District ID
    • District name
    • Grade
    • Test subject
    • Estimated male-female gap
  • Covariate data
    • District ID
    • Locale
    • Grade
    • Socioeconomic status (all demographic groups)
    • Log median income (all demographic groups)
    • Poverty rate (all demographic groups)
    • Unemployment rate (all demographic groups)
    • SNAP benefit receipt rate (all demographic groups)

Question 5: Variable names of interest

Download the codebooks by opening the ‘data’ directory from your Jupyter Lab file navigator and downloading the codebook files. Identify the variables listed above, and store the column names in lists named main_vars and cov_vars.

# store variable names of interest
main_vars = ['sedalea', 'grade', 'sedaleaname', 'subject', 'cs_mn_mfg'] # SOLUTION
cov_vars = ['sedalea', 'locale', 'grade', 'sesall', 'lninc50all', 'povertyall', 'unempall', 'snapall'] # SOLUTION
grader.check("q5")

Question 6: Slice columns

Use your result from above to slice the columns of interest from the covariate and test score data. Store the resulting data frames as main_sub and cov_sub (for ‘subset’).

# slice columns to select variables of interest
main_sub = ca_main.loc[:, main_vars] # SOLUTION
cov_sub = ca_cov.loc[:, cov_vars] # SOLUTION
grader.check("q6")

In the next step you’ll merge the covariate data with the test score data. In order to do this, you can use the pd.merge(A, B, how = ..., on = SHARED_COLS) function, which will match the rows of A and B based on the shared columns SHARED_COLS. If how = 'left', then only rows in A will be retained in the output (so B will be merged to A); conversely, if how = 'right', then only rows in B will be retained in the output (so A will be merged to B).

A simple example of the use of pd.merge is illustrated below:

# toy data frames
A = pd.DataFrame(
    {'shared_col': ['a', 'b', 'c'], 
    'x1': [1, 2, 3], 
    'x2': [4, 5, 6]}
)

B = pd.DataFrame(
    {'shared_col': ['a', 'b'], 
    'y1': [7, 8]}
)
A
shared_col x1 x2
0 a 1 4
1 b 2 5
2 c 3 6
B
shared_col y1
0 a 7
1 b 8

Below, if A and B are merged retaining the rows in A, notice that a missing value is input because B has no row where the shared column (on which the merging is done) has value c. In other words, the third row of A has no match in B.

# left join
pd.merge(A, B, how = 'left', on = 'shared_col')
shared_col x1 x2 y1
0 a 1 4 7.0
1 b 2 5 8.0
2 c 3 6 NaN

If the direction of merging is reversed, and the row structure of B is dominant, then the third row of A is dropped altogether because it has no match in B.

# right join
pd.merge(A, B, how = 'right', on = 'shared_col')
shared_col x1 x2 y1
0 a 1 4 7
1 b 2 5 8

Question 7: Merge

Merge the covariate and test score data on both the district ID and grade level columns, and retain only the columns from the test score data (meaning, merge the covariate data to the test score data). Store the resulting data frame as rawdata and print the first four rows.

# merge covariates with gap data
rawdata = pd.merge(cov_sub, main_sub, how = 'right', on = ['sedalea', 'grade']) # SOLUTION

# print first four rows
rawdata.head(4) # SOLUTION
sedalea locale grade sesall lninc50all povertyall unempall snapall sedaleaname subject cs_mn_mfg
0 600001 Rural, Distant 4.0 1.237209 11.392048 0.091894 0.048886 0.035165 ACTON-AGUA DULCE UNIFIED ... mth NaN
1 600001 Rural, Distant 4.0 1.237209 11.392048 0.091894 0.048886 0.035165 ACTON-AGUA DULCE UNIFIED ... rla NaN
2 600001 Rural, Distant 6.0 1.237209 11.392048 0.091894 0.048886 0.035165 ACTON-AGUA DULCE UNIFIED ... rla NaN
3 600001 Rural, Distant 8.0 1.237209 11.392048 0.091894 0.048886 0.035165 ACTON-AGUA DULCE UNIFIED ... mth -0.562855
grader.check("q7")

Question 8: Rename and reorder columns

Now rename and rearrange the columns of rawdata so that they appear in the following order and with the following names:

  • District ID, District, Locale, log(Median income), Poverty rate, Unemployment rate, SNAP rate, Socioeconomic index, Grade, Subject, Gender gap

Store the resulting data frame as rawdata_mod1 and print the first four rows.

(Hint: first define a dictionary to map the old names to the new ones; then create a list of the new names specified in the desired order; then use .rename() and .loc[]. You can follow the renaming steps in HW1 as an example if needed.)

# define dictionary mapping for renaming columns
# BEGIN SOLUTION
name_dict = {
    'sedalea': 'District ID',
    'sedaleaname': 'District',
    'sesall': 'Socioeconomic index',
    'lninc50all': 'log(Median income)',
    'povertyall': 'Poverty rate',
    'unempall': 'Unemployment rate',
    'snapall': 'SNAP rate',
    'cs_mn_mfg': 'Gender gap',
    'subject': 'Subject',
    'locale': 'Locale',
    'grade': 'Grade'
}
# END SOLUTION

# specify order of columns
# BEGIN SOLUTION
name_order = ['District ID', 'District', 'Locale',  
                   'log(Median income)', 'Poverty rate', 'Unemployment rate', 'SNAP rate', 
                   'Socioeconomic index', 'Grade', 'Subject', 'Gender gap']

# END SOLUTION

# rename and reorder
# BEGIN SOLUTION
rawdata_mod1 = rawdata.rename(
    columns = name_dict
).loc[:, name_order]
# END SOLUTION

# print first four rows
rawdata_mod1.head(4) # SOLUTION
District ID District Locale log(Median income) Poverty rate Unemployment rate SNAP rate Socioeconomic index Grade Subject Gender gap
0 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 4.0 mth NaN
1 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 4.0 rla NaN
2 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 6.0 rla NaN
3 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 8.0 mth -0.562855
grader.check("q8")

Question 9: Pivot

Notice that the Gender gap column contains the values of two variables: the gap in estimated mean test scores for math tests, and the gap in estimated mean test scores for reading and language tests. To put the data in tidy format, use .pivot and .rename() to pivot the table so that the gender gap column is spread into two columns named Math gap and Reading gap. Store the result as seda_data and print the first four rows.

Hint: to avoid unweildy column indexing, make sure you specify a values = ... argument when using .pivot(). Doing so will result in the column index being named Subject; remove this name in your solution.

Aside: an alternative solution is to manipulate the indices and use .unstack(), but this method will produce a dataframe with hierarchical column indexing (you’ll see) in which Subject is retained as a lower-level index; this will need to be collapsed in order to rename the columns as instructed using MultiIndex.droplevel() or similar.

# pivot to unstack gender gap (fixing tidy issue: multiple variables in one column)
# BEGIN SOLUTION
seda_data = rawdata_mod1.pivot(
    index = name_order[0:9], 
    columns = 'Subject',
    values = 'Gender gap'
    ).rename(
    columns = {'mth': 'Math gap', 'rla': 'Reading gap'}
    ).reset_index()
seda_data.columns.name = None

# alternative solution using index manipulations
# seda_data = rawdata_mod1.set_index(name_order[0:10]).unstack(-1).reset_index() # option 1
# idx = seda_data.columns.droplevel(1)
# idx.values[9:11] = ["Math gap", "Reading gap"]
# seda_data.columns = idx
# END SOLUTION

# print first four rows
seda_data.head(4) # SOLUTION
District ID District Locale log(Median income) Poverty rate Unemployment rate SNAP rate Socioeconomic index Grade Math gap Reading gap
0 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 4.0 NaN NaN
1 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 6.0 NaN NaN
2 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 8.0 -0.562855 -0.785321
3 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large 11.607236 0.041418 0.048269 0.028006 1.912972 4.0 -0.025131 -0.521408
grader.check("q9")

Your final dataset should match the dataframe below. You can use this to check your answer and revise any portions above that lead to different results.

# intended result
data_reference = pd.read_csv('data/tidy-seda-check.csv')
data_reference
District ID District Locale log(Median income) Poverty rate Unemployment rate SNAP rate Socioeconomic index Grade Math gap Reading gap
0 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 4.0 NaN NaN
1 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 6.0 NaN NaN
2 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant 11.392048 0.091894 0.048886 0.035165 1.237209 8.0 -0.562855 -0.785321
3 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large 11.607236 0.041418 0.048269 0.028006 1.912972 4.0 -0.025131 -0.521408
4 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large 11.607236 0.041418 0.048269 0.028006 1.912972 5.0 0.143163 -0.210279
5 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large 11.607236 0.041418 0.048269 0.028006 1.912972 6.0 0.065459 -0.034993
6 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large 11.607236 0.041418 0.048269 0.028006 1.912972 7.0 NaN -0.203610
7 600011 FORT SAGE UNIFIED ... Rural, Distant 10.704570 0.159981 0.066333 0.102054 -0.478127 4.0 -0.015417 -0.191400
8 600011 FORT SAGE UNIFIED ... Rural, Distant 10.704570 0.159981 0.066333 0.102054 -0.478127 5.0 NaN NaN
9 600011 FORT SAGE UNIFIED ... Rural, Distant 10.704570 0.159981 0.066333 0.102054 -0.478127 7.0 NaN NaN

Question 10: Sanity check

Ensure that your tidying did not inadvertently drop any observations: count the number of units in seda_data. Does this match the number of units represented in the original test score data ca_main? Store these values as data_units and ca_main_units, respectively.

(Hint: use .nunique().)

# number of districts in tidied data compared with raw
data_units = seda_data['District ID'].nunique() # SOLUTION
ca_main_units = ca_main.sedalea.nunique() # SOLUTION
grader.check("q10")

Question 11: Missing values

Gap estimates were not calculated for certain grades in certain districts due to small sample sizes (not enough individual tests recorded). Answer the following:

    1. What proportion of rows are missing for each of the reading and math gap variables? Store these values as math_missing and reading_missing, respectively.
    1. What proportion of districts (not rows!) have missing gap estimates for one or both test subjects for at least one grade level?Store the value as district_missing.
# proportion of missing values
math_missing, reading_missing = seda_data.loc[:, seda_data.columns.str.endswith('gap')].isna().mean() # SOLUTION
# proportion of districts with missing values

# BEGIN SOLUTION
district_missing = np.mean(
    seda_data.set_index(
        'District ID'
    ).loc[:, ['Math gap', 'Reading gap']].mean(
        axis = 1, 
        skipna = False
    ).isna().reset_index().groupby(
        'District ID'
    ).sum() > 0, axis = 0).values[0]
# END SOLUTION
grader.check("q11")

Question 12: Missing mechanism

Do you expect that this missingness is more likely for some districts than for others? If so, explain; why is this, and is bias a concern if missing values are dropped?

Type your answer here, replacing this text.

SOLUTION Since missing estimates arise from small sample sizes, missing values are more likely for districts with lower enrollment. Dropping these values would bias results towards larger districts, however, since the sample is nonrandom anyway, this is not a formal concern over and above existing inferential limitations.

Exploratory graphics

For the purpose of visualizing the relationship between estimated gender gaps and socioeconomic variables, you’ll find it more helpful to store a non-tidy version of the data. The cell below rearranges the dataset so that one column contains an estimated gap, one column contains the value of a socioeconomic variable, and the remaining columns record the gap type and variable identity.

Ensure that your results above match the reference dataset before running this cell.

# format data for plotting
plot_df = seda_data.melt(
    id_vars = name_order[0:9],
    value_vars = ['Math gap', 'Reading gap'],
    var_name = 'Gap type',
    value_name = 'Gap'
).melt(
    id_vars = ['District ID', 'District', 'Locale', 'Gap type', 'Gap', 'Grade'],
    value_vars = name_order[3:8],
    var_name = 'Socioeconomic variable',
    value_name = 'Measure'
)

# preview
plot_df.head()
District ID District Locale Gap type Gap Grade Socioeconomic variable Measure
0 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant Math gap NaN 4.0 log(Median income) 11.392048
1 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant Math gap NaN 6.0 log(Median income) 11.392048
2 600001 ACTON-AGUA DULCE UNIFIED ... Rural, Distant Math gap -0.562855 8.0 log(Median income) 11.392048
3 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large Math gap -0.025131 4.0 log(Median income) 11.607236
4 600006 ROSS VALLEY ELEMENTARY ... Suburb, Large Math gap 0.143163 5.0 log(Median income) 11.607236

Gender gaps and socioeconomic factors

The cell below generates a panel of scatterplots showing the relationship between estimated gender gap and socioeconomic factors for all grade levels by test subject. The plot suggests that the reading gap favors girls consistently across the socioeconomic spectrum – in a typical district girls seem to outperform boys by 0.25 standard deviations of the national average. By contrast, the math gap appears to depend on socioeconomic factors – boys only seem to outperform girls under better socioeconomic conditions.

# plot gap against socioeconomic variables by subject for all grades
fig1 = alt.Chart(plot_df).mark_circle(opacity = 0.1).encode(
    y = 'Gap',
    x = alt.X('Measure', scale = alt.Scale(zero = False), title = ''),
    color = 'Gap type'
).properties(
    width = 100,
    height = 100
).facet(
    column = alt.Column('Socioeconomic variable')
).resolve_scale(x = 'independent')

fig1

Question 13: Relationships by grade level

Does the pattern shown in the plot above persist within each grade level? Modify the plot above to show these relationships by grade level: generate a panel of scatterplots of gap against socioeconomic measures by subject, where each column of the panel corresponds to one socioeconomic variable and each row corresponds to one grade level; the result should by a 5x5 panel. Resize the width and height of each facet so that the panel is of reasonable size. Keep a fixed axis scale for the variable of interest, but allow the axis scales for socioeconomic variables to vary independently. Store the plot as fig2; display the figure and provide an answer to the question of interest in the text cell.

(Hint: you may find it useful to have a look at the altair documentation on compound charts, and lab 3, for examples to follow.)

Type your answer here, replacing this text.

# plotting codes here
# BEGIN SOLUTION
fig2 = alt.Chart(plot_df).mark_circle(opacity = 0.2).encode(
    y = 'Gap',
    x = alt.X('Measure', scale = alt.Scale(zero = False), title = ''),
    color = 'Gap type'
).properties(
    width = 75,
    height = 75
).facet(
    column = alt.Column('Socioeconomic variable'),
    row = alt.Row('Grade')
).resolve_scale(x = 'independent')

# END SOLUTION

# display
fig2 # SOLUTION

SOLUTION Yes, the patterns observed across all grade levels are similar to the patterns within grade level.

Question 14: Association with grade level

Do gaps shift across grade levels? It’s not so easy to tell from the last figure. Construct a 2x5 panel of scatterplots showing estimated achievement gap against each of the 5 socioeconomic variables, with one row per test subject. Display grade level using a color gradient. Store the plot as fig3; display the figure and answer the question of interest in a short sentence or two in the text cell provided.

Type your answer here, replacing this text.

# plotting codes here
# BEGIN SOLUTION
fig3 = alt.Chart(plot_df).mark_circle(opacity = 0.2).encode(
    y = 'Gap',
    x = alt.X('Measure', scale = alt.Scale(zero = False), title = ''),
    color = 'Grade'
).properties(
    width = 75,
    height = 75
).facet(
    column = alt.Column('Socioeconomic variable'),
    row = alt.Row('Gap type')
).resolve_scale(x = 'independent')

# END SOLUTION

# display
fig3 # SOLUTION

SOLUTION Yes – the scatter shifts from dark to light as the estimated gap decreases for both subjects and all socioeconomic variables, indicating that as grade level increases, the gap increasingly favors girls in both math and reading and language.

While the magnitude of the achievement gaps seems to depend very slightly on grade level (figure 3), the form of relationship between achievement gap and socioeconomic factors does not differ from grade to grade (figure 2).

Given that the relationships between achievement gaps and socioeconomic factors don’t change drastically across grade levels, it is reasonable to look at the average relationship between estimated achievement gap and median income after aggregating across grade.

Question 15: Aggregation across grade levels

Compute the mean estimated achievement gap in each subject across grade levels by district using District ID and retain the district-level socioeconomic variables. Store the resulting data frame as seda_data_agg.

Note: best practice here would be to aggregate just the test scores by district and then re-merge the result with the district-level socioeconomic variables. However, since the district-level socioeconomic variables do not differ by grade within a district, averaging them across grade levels by district together with the test scores will simply return their unique values; so the aggregation can be applied across all columns for a fast-and-loose way to obtain the desired result.

# aggregate across grades
# BEGIN SOLUTION
seda_data_agg = seda_data.groupby(
    ['District ID']
    ).mean(
    numeric_only = True
    ).reset_index(
    ).drop(columns = 'Grade')
# END SOLUTION

# print first few rows
seda_data_agg.head() # SOLUTION
District ID log(Median income) Poverty rate Unemployment rate SNAP rate Socioeconomic index Math gap Reading gap
0 600001 11.392048 0.091894 0.048886 0.035165 1.237209 -0.562855 -0.785321
1 600006 11.607236 0.041418 0.048269 0.028006 1.912972 0.061163 -0.242572
2 600011 10.704570 0.159981 0.066333 0.102054 -0.478127 -0.015417 -0.191400
3 600012 10.589787 0.179102 0.059158 0.074903 -0.096379 NaN NaN
4 600013 11.399662 0.060338 0.045533 0.035016 1.398133 0.054454 -0.312638
grader.check("q15")

Question 16: Melt aggregated data for plotting

Similar to working with the disaggregated data, it will be helpful for plotting to melt the two gap variables into a single column. Follow the example above at the beginning of this section to melt only the test score gap columns (not the district-level variables – we will not create scatterplot panels as before). Name the new columns Subject and Average estimated gap; store the resulting data frame as agg_plot_df and print the first four rows.

# format for plotting
# BEGIN SOLUTION
agg_plot_df = seda_data_agg.melt(
    id_vars = seda_data_agg.columns[0:6],
    value_vars = ['Math gap', 'Reading gap'],
    var_name = 'Subject',
    value_name = 'Average estimated gap'
)
# END SOLUTION

# print four rows
agg_plot_df.head(4) # SOLUTION
District ID log(Median income) Poverty rate Unemployment rate SNAP rate Socioeconomic index Subject Average estimated gap
0 600001 11.392048 0.091894 0.048886 0.035165 1.237209 Math gap -0.562855
1 600006 11.607236 0.041418 0.048269 0.028006 1.912972 Math gap 0.061163
2 600011 10.704570 0.159981 0.066333 0.102054 -0.478127 Math gap -0.015417
3 600012 10.589787 0.179102 0.059158 0.074903 -0.096379 Math gap NaN
grader.check("q16")

Question 17: District average gaps

Construct a scatterplot of the average estimated gap against log(Median income) by subject for each district and add trend lines (see lab 4). Store the plot as fig4. Describe and interpret the plot in a few sentences.

Type your answer here, replacing this text.

# scatterplot
# BEGIN SOLUTION
base = alt.Chart(agg_plot_df).mark_point(opacity = 0.5).encode(
    y = 'Average estimated gap',
    x = alt.X('log(Median income)', scale = alt.Scale(zero = False)),
    color = 'Subject'
)
# END SOLUTION

# trend line
trend = base.transform_regression('log(Median income)', 'Average estimated gap', groupby = ['Subject']).mark_line() # SOLUTION

# combine layers
fig4 = base + trend # SOLUTION

# display
fig4 # SOLUTION

SOLUTION Figure 4 shows average estimated achievement gaps – the difference between boys’ scores and girls’ scores – on math and reading tests across grade levels for 872 school districts in California against the median district income. Linear fits help visualize trends. The reading achievement gap favors girls and appears uncorrelated with median income; the math achievement gap increasingly favors boys in more affluent districts.

Now let’s try to capture this pattern in tabular form. The cell below adds an Income bracket variable by cutting the median income into 8 contiguous intervals using pd.cut(), and tabulates the average socioeconomic measures and estimated gaps across districts by income bracket. Notice that with respect to the gaps, this displays the pattern that is shown visually in the figures above.

seda_data_agg['Income bracket'] = pd.cut(np.e**seda_data_agg['log(Median income)'], 8)
seda_data_agg.groupby('Income bracket').mean().drop(columns = ['District ID', 'log(Median income)'])
Poverty rate Unemployment rate SNAP rate Socioeconomic index Math gap Reading gap
Income bracket
(21980.176, 46455.372] 0.194870 0.072689 0.155061 -0.651999 -0.070284 -0.309743
(46455.372, 70736.321] 0.134078 0.063788 0.095303 0.291085 -0.034061 -0.315545
(70736.321, 95017.269] 0.088713 0.052785 0.048242 1.110433 0.004239 -0.302114
(95017.269, 119298.218] 0.064131 0.046848 0.030548 1.640159 0.050006 -0.287117
(119298.218, 143579.167] 0.050315 0.044343 0.011023 2.167272 0.090138 -0.289529
(143579.167, 167860.115] 0.043896 0.042379 0.008451 2.382258 0.084683 -0.335975
(167860.115, 192141.064] 0.040552 0.040120 0.010159 2.652906 0.175793 -0.232306
(192141.064, 216422.013] 0.047097 0.054055 0.002555 2.588499 0.267301 -0.299798

Question 18: Proportion of districts with a math gap

What proportion of districts in each income bracket have an average estimated math achievement gap favoring boys? Answer this question by performing the following steps:

  • Append an indicator variable Math gap favoring boys to seda_data_agg that records whether the average estimated math gap favors boys by more than 0.1 standard deviations relative to the national average.
  • Compute the proportion of districts in each income bracket for which the indicator is true: group by bracket and take the mean. Store this as income_bracket_boys_favored
# define indicator
seda_data_agg['Math gap favoring boys'] = seda_data_agg['Math gap'] > 0.1 # SOLUTION

# proportion of districts with gap favoring boys, by income bracket
# BEGIN SOLUTION
income_bracket_boys_favored = seda_data_agg.groupby(
    ['Income bracket']
    ).mean(
    ).reset_index(
    ).loc[:, ['Income bracket', 'Math gap favoring boys']] 
# END SOLUTION

# print result
income_bracket_boys_favored # SOLUTION
Income bracket Math gap favoring boys
0 (21980.176, 46455.372] 0.036585
1 (46455.372, 70736.321] 0.061224
2 (70736.321, 95017.269] 0.084337
3 (95017.269, 119298.218] 0.232143
4 (119298.218, 143579.167] 0.388889
5 (143579.167, 167860.115] 0.444444
6 (167860.115, 192141.064] 0.500000
7 (192141.064, 216422.013] 1.000000
grader.check("q18")

Question 19: Statewide averages

To wrap up the exploration, calculate a few statewide averages to get a sense of how some of the patterns above compare with the state as a whole.

    1. Compute the statewide average estimated achievement gaps. Store the result as state_avg.
    1. Compute the proportion of districts in the state with a math gap favoring boys. Store this result as math_boys_proportion
    1. Compute the proportion of districts in the state with a math gap favoring girls. You will need to define a new indicator within seda_data_agg to perform this calculation.
# statewide average
state_avg = seda_data_agg.loc[:, ['Reading gap', 'Math gap']].mean() # SOLUTION

# proportion of districts in the state with a math gap favoring boys
math_boys_proportion = seda_data_agg['Math gap favoring boys'].mean() # SOLUTION

# proportion of districts in the state with a math gap favoring girls
seda_data_agg['Math gap favoring girls'] = seda_data_agg['Math gap'] < -0.1 # SOLUTION
math_girls_proportion = seda_data_agg['Math gap favoring girls'].mean() # SOLUTION
grader.check("q19")

Communicating results

Take a moment to review and reflect on your findings and consider what you have learned from the analysis.

Question 20: Summary

Write a brief summary of your exploratory analysis. What have you discovered about educational achievement gaps in California school districts? Aim to answer in 3-5 sentences or less.

Type your answer here, replacing this text.

SOLUTION: In 872 California school districts in 2018, girls in grades 4 - 8 consistently outperformed boys on reading and language assessments by roughly an estimated 0.3 standard deviations of the national average. This gap persisted regardless of the socioeconomic characteristics of the district. By contrast, boys outperformed girls on math assessments on average only in more affluent school districts; otherwise, boys and girls performed equally or girls performed better (as was the case in 13% of districts). This pattern was robust across grade levels; however, the estimated magnitude of math achievement gaps shifted toward favoring girls more often for higher grade levels.

Submission

  1. Save the notebook.
  2. Restart the kernel and run all cells. (CAUTION: if your notebook is not saved, you will lose your work.)
  3. Carefully look through your notebook and verify that all computations execute correctly. You should see no errors; if there are any errors, make sure to correct them before you submit the notebook.
  4. Download the notebook as an .ipynb file. This is your backup copy.
  5. Export the notebook as PDF and upload to Gradescope.
grader.check_all()